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Abstract — Background subtraction is a fundamental low-level 
processing task in numerous computer vision applications. The 
vast majority of algorithms process images on a pixel-by-pixel 
basis, where an independent decision is made for each pixel. 
A general limitation of such processing is that rich contextual 
information is not taken into account. We propose a block-based 
method capable of dealing with noise, illumination variations and 
dynamic backgrounds, while still obtaining smooth contours of 
foreground objects. Specifically, image sequences are analysed on 
an overlapping block-by-block basis. A low-dimensional texture 
descriptor obtained from each block is passed through an 
adaptive classifier cascade, where each stage handles a distinct 
problem. A probabilistic foreground mask generation approach 
then exploits block overlaps to integrate interim block-level 
decisions into final pixel-level foreground segmentation. Unlike 
many pixel-based methods, ad-hoc post-processing of foreground 
masks is not required. Experiments on the difficult Wallflower 
and I2R datasets show that the proposed approach obtains on 
average better results (both qualitatively and quantitatively) than 
several prominent methods. We furthermore propose the use of 
tracking performance as an unbiased approach for assessing 
the practical usefulness of foreground segmentation methods, 
and show that the proposed approach leads to considerable 
improvements in tracking accuracy on the CAVIAR dataset. 

Index Terms — foreground detection, background subtraction, 
segmentation, background modefling, cascade, patch analysis. 

I. Introduction 

One of the fundamental and critical tasks in many computer- 
vision applications is the segmentation of foreground objects 
of interest from an image sequence. The accuracy of seg- 
mentation can significantly affect the overall performance of 
the application employing it — subsequent processing stages 
use only the foreground pixels rather than the entire frame. 
Segmentation is employed in diverse applications such as 
tracking [1], [2], action recognition [3], gait recognition [4], 
anomaly detection [5], [6], content based video coding [7], 
[8], [9], and computational photography [10]. 

In the literature, foreground segmentation algorithms for an 
image sequence (video) are typically based on segmentation 
via background modelling [11], [12], which is also known as 
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background subtraction [13], [14]. We note that foreground 
segmentation is also possible via optical flow analysis [15], 
an energy minimisation framework [7], [16] as well as highly 
object specific approaches, such as detection of faces and 
pedestrians [17], [18]. 

Methods based on optical flow are prone to the aperture 
problem [19] and rely on movement — stationary objects are 
not detected. Methods based on energy minimisation require 
user intervention during their initialisation phase. Specifically, 
regions belonging to foreground and background need to 
be explicitly labelled in order to build prior models. This 
requirement can impose severe restrictions in applications 
where multiple foreground objects are entering/exiting the 
scene (eg. in surveillance applications). 

Detection of specific objects is influenced by the training 
data set which should ideally be exhaustive and encompass 
all possible variations/poses of the object — in practice, this 
is hard to achieve. Furthermore, the type of objects to be 
detected must be known a priori. These constraints can make 
object specific approaches unfavourable in certain surveillance 
environments — typically outdoors, where objects of various 
classes can be encountered, including pedestrians, cars, bikes, 
and abandoned baggage. 

In this work^ we focus on the approach of foreground 
segmentation via background modelling, which can be for- 
mulated as a binary classification problem. Unlike the other 
approaches mentioned above, no constraints are imposed on 
the nature, shape or behaviour of foreground objects appearing 
in the scene. The general approach is as follows. Using a 
training image sequence, a reference model of the back- 
ground is generated. The training sequence preferably contains 
only the dynamics of the background (eg. swaying branches, 
ocean waves, illumination variations, cast shadows). Incoming 
frames are then compared to the reference model and pixels 
or regions that do not fit the model (ie. outliers) are labelled 
as foreground. Optionally, the reference model is updated with 
areas that are deemed to be the background in the processed 
frames. 

In general, foreground areas are selected in one of two ways: 
(i) pixel-by-pixel, where an independent decision is made for 
each pixel, and (ii) region-based, where a decision is made 
on an entire group of spatially close pixels. Below we briefly 

^This paper is a revised and extended version of our earlier work [20]. 



overview several notable papers in both categories. As an in- 
depth review of existing literature is beyond the scope of this 
paper, we refer the reader to several recent surveys for more 
details [11], [12], [13], [14], [21]. 

The vast majority of the algorithms described in the litera- 
ture belong to the pixel-by-pixel category. Notable examples 
include techniques based on modelling the distribution of 
pixel values at each location. For example, Stauffer and 
Grimson [22] model each pixel location by a Gaussian mixture 
model (GMM). Extensions and improvements to this method 
include model update procedures [23], adaptively changing the 
number of Gaussians per pixel [24] and selectively filtering 
out pixels arising due to noise and illumination, prior to 
applying GMM [25]. Some techniques employ non-parametric 
modelling — for instance, Gaussian kernel density estima- 
tion [26] and a Bayes decision rule for classification [27]. 
The latter method models stationary regions of the image by 
colour features and dynamic regions by colour co-occurrence 
features. The features are modelled by histograms. 

Other approaches employ more complex strategies in order 
to improve segmentation quality in the presence of illumi- 
nation variations and dynamic backgrounds. For instance, 
Han and Davis [28] represent each pixel location by colour, 
gradient and Haar-like features. They use kernel density ap- 
proximation to model features and a support vector machine 
for classification. Parag et al. [29] automatically select a subset 
of features at each pixel location using a boosting algorithm. 
Online discriminative learning [30] is also employed for real- 
time background subtraction using a graphics accelerator. To 
address long and short term illumination changes separately, 
[31], [32] maintain two distinct background models for colour 
and texture features. Hierarchical approaches [33], [34] anal- 
yse data from various viewpoints (such as frame, region and 
pixel levels). A related strategy which employs frame-level 
analysis to model background is subspace learning [35], [36]. 
Although these methods process data at various levels, the 
classification is still made at pixel level. 

More recently, Lopez-Rubio et al. [37] maintain a dual 
mixture model at each pixel location for modelling the 
background and foreground distributions, respectively. The 
background pixels are modelled by a Gaussian distribution 
while the foreground pixels are modelled by an uniform 
distribution. The models are updated using a stochastic ap- 
proximation technique. Probabilistic self-organising maps have 
also been examined to model the background [38], [39]. To 
mitigate pixel-level noise, [39] also considers a given pixel's 
8-connected neighbours prior to its classification. 

Notwithstanding the numerous improvements, an inherent 
limitation of pixel-by-pixel processing is that rich contextual 
information is not taken into account. For example, pixel-based 
segmentation algorithms may require ad-hoc post-processing 
(eg. morphological operations [40]) to deal with incorrectly 
classified and scattered pixels in the foreground mask. 

In comparison to the pixel-by-pixel category, relatively 
little research has been done in the region-based category. 
In the latter school of thought, each frame is typically split 
into blocks (or patches) and the classification is made at 
the block-level (ie. effectively taking into account contextual 



information). As adjacently located blocks are typically used, 
a general limitation of region-based methods is that the gener- 
ated foreground masks exhibit 'blockiness' artefacts (ie. rough 
foreground object contours). 

Differences between blocks from a frame and the back- 
ground can be measured by, for example, edge histograms [41] 
and normalised vector distances [42]. Both of the above 
methods handle the problem of varying illumination but do 
not address dynamic backgrounds. In methods [43], [44] for 
each block of the background, a set of identical classifiers 
are trained using online boosting. Blocks yielding a low 
confidence score are treated as foreground. Other techniques 
within this family include exploiting spatial co-occurrences 
of variations (eg. waving trees, illumination changes) across 
neighbouring blocks [45], as well as decomposing a given 
video into spatiotemporal blocks to obtain a joint represen- 
tation of texture and motion patterns [46], [47]. The use of 
temporal analysis in the latter approach aids in building good 
representative models but at an increased computational cost. 

In this paper we propose a robust foreground segmentation 
algorithm that belongs to the region-based category, but is able 
to make the final decisions at the pixel level. Briefly, a given 
image is split into overlapping blocks. Rather than relying on 
a single classifier for each block, an adaptive classifier cascade 
is used for initial labelling. Each stage analyses a given block 
from a unique perspective. The initial labels are then integrated 
at the pixel level. A pixel is probabilistically classified as fore- 
ground/background based on how many blocks containing that 
particular pixel have been classified as foreground/background. 

The performance of foreground segmentation is typically 
evaluated by comparing generated foreground masks with 
the corresponding ground- truth. As foreground segmentation 
can be used in conjunction with tracking algorithms (either 
as an aid or a necessary component [48]), we furthermore 
propose the use of object tracking performance as an additional 
method for assessing the practical usefulness of foreground 
segmentation methods. 

We continue the paper as follows. In Section II the proposed 
algorithm is described in detail. Performance evaluation and 
comparisons with five other algorithms are given in Section III. 
The main findings and possible future directions are sum- 
marised in Section IV. 

II. Proposed Foreground Detection Technique 
The proposed technique has four main components: 

• Division of a given image into overlapping blocks, fol- 
lowed by generating a low-dimensional descriptor for 
each block. 

• Classification of each block into foreground or back- 
ground, where each block is processed by a cascade 
comprised of three classifiers. 

• Model reinitialisation to address scenarios where a sud- 
den and significant scene change can make the current 
background model inaccurate. 

• Probabilistic generation of the foreground mask, where 
the classification decisions for all blocks are integrated 
into final pixel-level foreground segmentation. 



Each of the components is explained in more detail in the 
following sections. 



A. Blocking and Generation of Descriptors 

Each image is split into blocks which are considerably 
smaller than the size of the image (eg. 2x2, 4x4, . . . , 16x 16), 
with each block overlapping its neighbours by a configurable 
amount of pixels (eg. 1, 2, . . . , 8) in both the horizontal and 
vertical directions. Block overlapping can also be interpreted 
as block advancement. For instance, maximum overlapping 
between blocks corresponds to block advancements by 1 pixel. 

2D Discrete Cosine Transform (DCT) decomposition is 
employed to obtain a relatively robust and compact description 
of each block [40]. Image noise and minor variations are 
effectively ignored by keeping only several low-order DCT 
coefficients which reflect the average intensity and low fre- 
quency information [49]. Specifically, for a block located at 
(i, j), four coefficients per colour channel are retained (based 
on preliminary experiments), leading to a 12 dimensional 
descriptor: 



where Cn ^ denotes the n-th DCT coefficient from the /c-th 
colour channel, with k e {r,g,h}. 



B. Classifier Cascade 

Each block's descriptor is analysed sequentially by three 
classifiers, with each classifier using location specific param- 
eters. As soon as one of the classifiers deems that the block 
is part of the background, the remaining classifiers are not 
consulted. 

The first classifier handles dynamic backgrounds (such as 
waving trees, water surfaces and fountains), but fails when 
illumination variations exist. The second classifier analyses 
if the anomalies in the descriptor are due to illumination 
variations. The third classifier exploits temporal correlations 
(that naturally exists in image sequences) to partially handle 
changes in environmental conditions and minimise spurious 
false positives. The three classifiers are elucidated below. 

1 ) Probability measurement: The first classifier employs 
a multivariate Gaussian model for each of the background 
blocks. The likelihood of descriptor belonging to the 

background class is found via: 
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where ii^i^j) and are the mean vector and covariance 

matrix for location (i, j), respectively, while D is the dimen- 
sionality of the descriptors. For ease of implementation and 
reduced computational load, the dimensions are assumed to 
be independent and hence the covariance matrix is diagonal. 



To obtain //(^ and the first few seconds of the 

sequence are used for training. To allow the training se- 
quence to contain moving foreground objects, a robust es- 
timation strategy is employed instead of directly obtaining 
the parameters. Specifically, for each block location a two- 
component Gaussian mixture model is trained, followed by 
taking the absolute difference of the weights of the two 
Gaussians. If the difference is greater than 0.5 (based on 
preliminary experiments), we retain the Gaussian with the 
dominant weight. The reasoning is that the less prominent 
Gaussian is modelling moving foreground objects and/or other 
outliers. If the difference is less than 0.5, we assume that no 
foreground objects are present and use all available data for 
that particular block location to estimate the parameters of the 
single Gaussian. More involved approaches for dealing with 
foreground clutter during training are given in [50], [51]. 

If p{d^i^j)) > T^i,j), the corresponding block is classified 
as background. The value of T^i^j) is equal to p(t(^ .,)), where 

^) = ^) + 2diag(5](i .,)) 2 . Here the square root operation 
is applied element-wise. Under the diagonal covariance matrix 
constraint, this threshold covers about 95% of the distribu- 
tion [52]. 

If a block has been classified as background, the correspond- 
ing Gaussian model is updated using the adaptation technique 
similar to Wren et al. [53]. Specifically, the mean and diagonal 
covariance vectors are updated as follows: 
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2) Cosine distance: The second classifier employs a dis- 
tance metric based on the cosine of the angle subtended be- 
tween two vectors. Empirical observations suggest the angles 
subtended by descriptors obtained from a block exposed to 
varying illumination are almost the same. A similar phe- 
nomenon was also observed in RGB colour space [54]. 

If block has not been classified as part of the 

background by the previous classifier, the cosine distance is 
computed using: 



cosdist(d(i,^),/i(i,^)) 
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where fii^ij) is from Eqn. (2). If cosdist(d(i //(^j)) < Ci, 
block is deemed as background. The value of Ci is 

set to a low value such that it results in slightly more false 
positives than false negatives. This ensures a low probability 
of misclassifying foreground objects as background. However, 
the surplus false positives are eliminated during the creation 
of the foreground mask (Section II-D). Based on preliminary 
results, the constant Ci is set to 0.1% of the maximum value 
(for a cosine distance metric the maximum value is unity). 

3) Temporal correlation check: For each block, the third 
classifier takes into account the current descriptor as well as 
the corresponding descriptor from the previous image, denoted 
as d(^7)^^- Block is labelled as part of the background if 
the following two conditions are satisfied: 



1) d^^^'^p was classified as background; 

2) COSdist(d[^JU(^,,)) < ^2. 

Condition 1 ensures the cosine distance measured in Con- 
dition 2 is not with respect to a descriptor classified as 
foreground. As the sample points are consecutive in time and 
should be almost identical if ^) belongs to background, we 
use C2 = 0.5 X Ci. 

C. Model Reinitialisation 

A scene change might be too quick and/or too severe for the 
adaptation and classification strategies used above (eg. severe 
illumination change due to lights being switched on in a dark 
room). As such, the existing background model can wrongly 
detect a very large portion of the image as foreground. 

Model reinitialisation is triggered if a 'significant' portion 
of each image is consistently classified as foreground for a 
reasonable period of time. Specifically, the criteria for defining 
significant portion is dependent on parameters such as scene 
dynamics and size of foreground objects. Based on prelimi- 
nary evaluations, a threshold value of 70% appears to work 
reasonably well. In order to ensure the model quickly adapts 
to the new environment, reinitialisation is invoked as soon as 
this phenomenon is consistently observed for a time period 
of at least | second (ie. about 15 frames when sequences are 
captured at 30 fps). The corresponding images are accumulated 
and are used to rebuild the statistics of the new scene. Due to 
the small amount of retraining data, the covariance matrices 
are kept as is, while the new means are obtained as per the 
estimation method described in Section II-Bl. 

D. Probabilistic Foreground Mask Generation 

In typical block based classification methods, misclassifi- 
cation is inevitable whenever a given block has foreground 
and background pixels (examples are illustrated in Fig. 1). 
We exploit the overlapping nature of the block-based analysis 
to alleviate this inherent problem. Each pixel is classified 
as foreground only if a significant proportion of the blocks 
that contain that pixel are classified as foreground. In other 
words, a pixel that was misclassified a few times prior to 
mask generation can be classified correctly in the generated 
foreground mask. This decision strategy, similar to majority 
voting, effectively minimises the number of errors in the 
output. This approach is in contrast to conventional methods, 
such as those based on Gaussian mixture models [23], kernel 
density estimation [26] and codebook models [54], which do 
not have this built-in 'self-correcting' mechanism. 

Formally, let the pixel located at (x, y) in image / be denoted 
as I{x,y)' Furthermore, let ^) be the number of blocks 
containing pixel [x,y) that were classified as foreground (fg), 
and be the total number of blocks containing pixel {x,y). 
We define the probability of foreground being present in I{x,y) 
as: 

P (fg I = B[l^^ I BlT.^\ (6) 

If P (fg I I(x,y)) > 0.90 (based on preliminary analysis), pixel 
I(x,y) is labelled as part of the foreground. 




Block A Block B 

Fig. 1. Without taking into account block overlapping, misclassification 
is inevitable at the pixel level whenever a given block has both foreground 
(FG) and background (BG) pixels. Classifying Block A as background results 
in a few false negatives (foreground pixels classified as background) while 
classifying Block B as foreground results in a few false positives (background 
pixels classified as foreground). 

III. Experiments 

In this section, we first provide a brief description of the 
datasets used in our experiments in Section III-A. We then 
evaluate the effect of two key parameters (block size and 
block advancement) and the contribution of the three classifier 
stages to overall performance in Sections III-B and III-C, 
respectively. 

For comparative evaluation, we conducted two sets of exper- 
iments: (i) subjective and objective evaluation of foreground 
segmentation efficacy, using datasets with available ground- 
truths; (ii) comparison of the effect of the various foreground 
segmentation methods on tracking performance. The details 
of the experiments are described in Sections III-D and III-E, 
respectively. 

The proposed algorithm^ was implemented in C++ with 
the aid of Armadillo [56] and OpenCV libraries [57]. All 
experiments were conducted on a standard 3 GHz machine. 

A. Datasets 

We use three datasets for the experiments: I2R^, 
Wallflower"^, and CAVIAR^ The I2R dataset has nine se- 
quences captured in diverse and challenging environments 
characterised by complex backgrounds such as waving trees, 
fountains, and escalators. Furthermore, the dataset also ex- 
hibits the phenomena of illumination variations and cast 
shadows. For each sequence there are 20 randomly selected 
images for which the ground-truth foreground masks are 
available. The Wallflower dataset has seven sequences, with 
each sequence being a representative of a distinct problem 
encountered in background modelling [55]. The background 
is subjected to various phenomena which include sudden and 
gradual lighting changes, dynamic motion, camouflage, fore- 
ground aperture, bootstrapping and movement of background 
objects within the scene. Each sequence has only one ground- 
truth foreground mask. The second subset of CAVIAR, used 
for the tracking experiments, has 52 sequences with tracking 
ground truth data (ie. object positions). Example images from 
the three datasets are given in Figures 8, 9 and 10. 

^Source code for the proposed algorithm is available from 
http://arma.sourceforge.net/foreground/ 

^http://perception.i2r.a- star.edu. sg/bk_model/bk_index.html 
^http://research.microsoft.com/en-us/um/people/jckrumm/WallFlower/ 
Testlmages.htm 

^http://homepages.inf.ed.ac.uk/rbf/CAVIARDATAl/ 



TABLE I 

Accuracy of foreground estimation for various block sizes on 
THE I2R AND Wallflower datasets, with the block advancement 

FIXED AT 1 (IE. MAXIMUM OVERLAP). ACCURACY WAS MEASURED BY 
F-measure AVERAGED OVER ALL FRAMES WHERE GROUND-TRUTH IS 
AVAILABLE. THE 'MEAN' COLUMN INDICATES THE MEAN OF THE VALUES 
OBTAINED FOR THE TWO DATASETS. 



Block Size 


Average F-measure 


I2R 


Wallflower 


mean 


2x2 


0.726 


0.588 


0.657 


4x4 


0.791 


0.633 


0.712 


6x6 


0.790 


0.714 


0.752 


8x8 


0.780 


0.733 


0.756 


10x10 


0.760 


0.735 


0.735 


12x12 


0.732 


0.729 


0.731 


14x14 


0.704 


0.715 


0.710 


16x16 


0.659 


0.692 


0.675 



B. Effects of Block Size and Advancement (Overlapping) 

In this section we evaluate the effect of block size and 
block advancement to the overall performance. For quanti- 
tative evaluation we adopted the F-measure metric used by 
Brutzer et al. [12], which quantifies how similar the obtained 
foreground mask is to the ground- truth: 

^ recall • precision ^ ^ 

F-measure — 2 (7) 

recall + precision 

where F-measure G [0,1], while precision and recall are 
given by j^fj^ and ^^q^, respectively. The notations tp, 
fp and fn are total number of true positives, false positives 
and false negatives (in terms of pixels), respectively. The 
higher the F-measure value, the more accurate the foreground 
segmentation. 

Table I shows the performance of the proposed algorithm 
for block sizes ranging from 2x2 to 16x16, with the block 
advancement fixed at 1 (ie. maximum overlap between blocks). 
The optimal block size for the I2R dataset is 4x4, with the 
performance being quite stable from 4x4 to 8x8. For the 
Wallflower dataset the optimal size is 10x10, with similar 
performance obtained using 8x8 to 12x12. By taking the 
mean of the values obtained for each block size across both 
datasets, the overall optimal size appears to be 8 x 8. This block 
size is used in all following experiments. 

Figures 2 and 3 show the effect of block advancement 
on foreground segmentation accuracy and processing speed 
on the I2R dataset. As the block size is fixed to 8x8, 
block advancement of 8 pixels (between successive blocks) 
indicates no overlapping, while block advancement of 1 pixel 
denotes maximum overlap. The smaller the block advancement 
(ie. higher overlap), the higher the accuracy and smoother 
object contours, at the expense of a considerable increase in 
the computational load (due to more blocks that need to be 
processed). A block advancement of 1 pixel achieves the best 
F-measure value of 0.78, at the cost of low processing speed 
(10 frames per second). Increasing the block advancement to 
2 pixels somewhat decreases the F-measure value to 0.76, 
but the processing speed raises to 40 frames per second. 



C. Contribution of Individual Classifier Stages 

In the proposed algorithm, each classifier (see Section II-B) 
handles a distinct problem such as dynamic backgrounds 
and varying illuminations. In this section, the influence of 
individual classifiers to the overall segmentation performance 
is further investigated. We evaluate the segmentation quality 
using three separate configurations: (i) classification using the 
first classifier (based on multivariate Gaussian density func- 
tion) alone, (ii) classification using a combination of the first 
classifier followed by the second (based on cosine distance), 
(iii) classification using all stages. The the qualitative results of 
each configuration using the I2R dataset are shown in Fig. 4. 
The quantitative results of each configuration for various block 
advancements are shown in Fig. 5. 

We note that the best segmentation results are obtained 
for the default configuration when all 3 classifiers are used. 
The next best configuration is the combination of the first 
and second classifiers which independently inspect for scene 
changes occurring due to dynamic backgrounds and illumi- 
nation variations, respectively. The configuration comprising 
of only the first classifier yields the lowest F-measure value, 
since background variations due to illumination are not han- 
dled effectively by it. 

We note the impact of the third classifier appears to be minor 
compared to that of the second, since it is aimed to minimise 
the occasional false positives by examining the temporal corre- 
lations between consecutive frames (see Section II-B(c)). The 
relative improvement in average F-measure value achieved 
by adding the second classifier is about 37%, while adding 
the third gives further relative improvement of about 5%. 
Qualitative results of each configuration shown in Figure 4 
confirm the above observations. 

D. Comparative Evaluation by Ground-Truth F-measure 

The proposed algorithm is compared with segmentation 
methods based on Gaussian mixture models (GMMs) [23], 
feature histograms [27], probabilistic self organising 
maps (SOM) [39], stochastic approximation (SA) [37] 
and normalised vector distances (NVD) [42]. The first 
four methods classify individual pixels into foreground or 
background, while the last method makes decisions on groups 
of pixels. 

We used the OpenCV v2.0 [57] implementations for the 
GMM and feature histogram based methods with default 
parameters, except for setting the learning parameter in GMM 
to 0.001. Experiments showed that the above parameter set- 
tings produce optimal segmentation performance. We used the 
implementations made available by the authors' for SOM^ and 
SA^ methods. 

Post-processing using morphological operations was re- 
quired for the foreground masks obtained by the GMM, 
feature histogram and SOM methods, in order to clean up the 
scattered error pixels. For the GMM method, opening followed 
by closing using a 3 x 3 kernel was performed, while for 

^http://www.lcc.uma.es/'^ezeqlr/fsom/fsom.html 
^http://www.lcc.uma.es/"^ezeqlr/backsa/backsa.html 
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Fig. 2. (a) An example frame from the I2R dataset, (b) its corresponding ground-truth foreground mask. Using the proposed method with a block size of 
8 X 8, the foreground masks obtained for various degrees of block advancement: (c) 1 pixel, (d) 2 pixels, (e) 4 pixels, and (f) 8 pixels (ie. no overlap). 
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Fig. 3. Effect of block advancements on: (a) F -measure value and (b) processing speed in terms of frames per second obtained using the I2R dataset. 
A considerable gain in processing speed is achieved as the advancement between blocks increases, at the expense of a gradual decrease in F-measure values. 
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Fig. 4. (a) Example frames from I2R dataset. (b) Ground truth foreground masks. Foreground masks obtained by the proposed method using: (c) the 
first classifier only, (d) combination of the first and second classifiers, (e) using all three classifiers. Adding the second classifier considerably improves the 
segmentation quality, while the addition of the third classifier aids in minor reduction of false positives. See Fig. 5 for quantitative results. 
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Fig. 5. Impact of individual classifiers on the overall segmentation quality for various block advancement values, using the I2R dataset. The best results 
are achieved when all 3 classifiers are used (default configuration). 
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Fig. 6. Comparison of F-measure values (defined in Eqn. (7)) obtained on the I2R dataset using foreground segmentation methods based on GMMs [23], 
feature histograms [27], NVD [42], SOM [39], SA [37] and the proposed method. The higher the F-measure (ie. agreement with ground-truth), the better 
the segmentation result. 
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Fig. 7. As per Fig. 6, but obtained on the Wallflower dataset. Due to the absence of true positives in the ground- truth for the moved object sequence, the 
corresponding F-measure is zero for all algorithms. 



the feature histogram method we enabled the built-in post- 
processor (using default parameters suggested in the OpenCV 
implementation). We note that the proposed method does not 
require any such ad-hoc post-processing. 

With the view of designing a pragmatic system, the same 
parameter settings were used across all sequences (ie. they 
were not optimised for any particular sequence). Specifically, 
during deployment a practical system has to perform robustly 
in many scenarios. 

We present both qualitative and quantitative analysis of 



the results. Figs. 6 and 7 show quantitative results for the 
I2R and Wallflower datasets, respectively. The corresponding 
qualitative results for three sequences from each dataset are 
shown in Figs. 8 and 9. 

In Fig. 8, the AP sequence (left column) has significant 
cast shadows of people moving at an airport. The FT sequence 
(middle column) contains people moving against a background 
of a fountain with varying illumination. The MR sequence 
(right column) shows a person entering and leaving a room 
where the window blinds are non- stationary and there are 
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Fig. 8. (a) Example frames from three video sequences in the I2R 
dataset. Left: people walking at an airport, with significant cast shadows. 
Middle: people moving against a background of a fountain with varying 
illumination. Right: a person walks in and out of a room where the window 
blinds are non- stationary, with illumination variations caused by automatic 
gain control of the camera, (b) Ground-truth foreground mask, and foreground 
mask estimation using: (c) GMM based [23] with morphological post- 
processing, (d) feature histograms [27], (e) NVD [42], (f) SOM [39], 
(g) SA [37], (h) proposed method. 
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Fig. 9. As per Fig. 8, but using the Wallflower dataset. Left: room 
illumination gradually increases over time and a person walks in and sits on 
the couch. Middle: person walking against a background of strongly waving 
trees and the sky. Right: a monitor displaying a blue screen with rolling bars 
is occluded by a person wearing blue coloured clothing. 



significant illumination variations caused by the automatic gain 
control of the camera. 

In Fig. 9, the time of day sequence (left column) has a 
gradual increase in the room's illumination intensity over time. 
A person walks in and sits on the couch. The waving trees 
sequence (middle column) has a person walking against a 
background consisting of the sky and strongly waving trees. In 
the camouflage sequence (right column), a monitor has a blue 
screen with rolling bars. A person in blue coloured clothing 
walks in and occludes the monitor. 

We note that output of the GMM based method (column c 
in Figs. 8 and 9) is sensitive to reflections, illumination 
changes and cast shadows. While the histogram based method 
(column d) overcomes these limitations, it has a lot of false 
negatives. The NVD based method (column e) is largely 
robust to illumination changes, but fails to handle dynamic 
backgrounds and produces 'blocky' foreground masks. The 
SOM and SA based methods have relatively few false positives 
and negatives. The results obtained by the proposed method 
(column f) are qualitatively better than those obtained by 
the other five methods, having low false positives and false 
negatives. However, we note that due to the the block-based 
nature of the analysis, objects very close to each other tend to 
merge. 

The quantitative results (using the F-measure metric) ob- 
tained on the I2R and Wallflower datasets, shown in Figs. 6 
and 7, respectively, largely confirm the visual results. On the 
I2R dataset the proposed method outperforms the other meth- 
ods in most cases. The next best method (SOM) obtained an 
average F-measure value of 0.72, while the proposed method 
achieved 0.78, representing an improvement of about 8%. 

On the Wallflower dataset the proposed method achieved 
considerably better results for the foreground aperture se- 
quence. While for the remainder of the sequences the per- 
formance was roughly on par with the other methods, the 
proposed method nevertheless still achieved the highest av- 
erage F-measure value. The next best method (histogram 
of features) obtained an average value of 0.66, while the 
proposed method obtained 0.73, representing an improvement 
of about 11%. 

We note that the performance of the proposed method on 
Bootstrapping sequence is lower. We conjecture that this is due 
to foreground objects occluding background during the train- 
ing phase. Robust background initialisation techniques [50], 
[51] capable of estimating the background in cluttered se- 
quences could be used to alleviate this problem. 



E. Comparative Evaluation by Tracking Precision & Accuracy 

We conducted a second set of experiments to evaluate the 
performance of the segmentation methods in more pragmatic 
terms rather than limiting ourselves to the traditional ground- 
truth evaluation approach. To this effect, we evaluated the 
influence of the various foreground detection algorithms on 
tracking performance. The foreground masks obtained from 
the detectors for each frame of the sequence were passed as 
input to an object tracking system. We have used a particle 




Fig. 10. Example frames from the second subset of the CAVIAR dataset, 
used for evaluating the influence of various foreground detection algorithms 
on tracking performance. 



TABLE II 

Effect of various settings of block advancement on multiple 
object tracking accuracy (mota) in terms of percentage, and 
multiple object tracking precision (motp) in terms of pixels. 
Results are obtained on second subset of CAVIAR by using a 

PARTICLE-FILTER BASED TRACKING ALGORITHM. 





Tracking 


Metrics 


Block Advancement 


MOTA 


MOTP 




(higher is better) 


(lower is better) 


1 


30.3 


11.7 


2 


20.4 


11.8 


4 


8.6 


12.4 


8 


-67.3 


14.7 



filter based tracker^ as implemented in the video surveillance 
module of OpenCV v2.0 [57]. Here the foreground masks are 
used prior to tracking for initialisation purposes. 

Tracking performance was measured with the two metrics 
proposed by Bernardin and Stiefelhagen [58], namely multiple 
object tracking precision (MOTP) and multiple object tracking 
accuracy (MOTA). 

Briefly, MOTP measures the average pixel distance between 
ground- truth locations of objects and their locations according 
to a tracking algorithm. Ground truth objects and hypotheses 
are matched using Munkres' algorithm [59]. MOTP is defined 
as: 

MOTP = ^^^dj/^^c, (8) 

where dl is the distance between object i and its corresponding 
hypothesis, while Ct is the number of matches found at time t. 
The lower the MOTP, the better. 

MOTA accounts for object configuration errors, false posi- 
tives, misses as well as mismatches. It is defined as: 

MOTA = 1 - ^tim. + fPt + mme,) 
Et 9t 

It measures accuracy in terms of the number of false 
negatives (m), false positives (fp) and mismatch errors (mme) 
with respect to the number of ground truth objects (g). The 
higher the value, the better the accuracy. The MOTA value 
can become negative in certain circumstances when the false 
negatives, false positives and mismatch errors are considerably 
large, making the ratio in Eqn. (9) greater than unity [58]. 

^ Additional simulations with other tracking algorithms, such as blob 
matching, mean shift and mean shift with foreground feedback, yielded similar 
results. 



TABLE III 

As PER Table. II, but obtained by employing various foreground 

DETECTION METHODS. 





Tracking 


Metrics 


Foreground detection 


MOTA 


MOTP 




(higher is better) 


(lower is better) 


GMM based method [23] 


27.2 


13.6 


NVD based method [42] 


-24.9 


15.2 


Histogram of features [27] 


13.7 


14.7 


SOM [39] 


26 


13.3 


SA [37] 


27.3 


13.0 


Proposed method 


30.3 


11.7 



The performance result is the average performance of the 
52 test sequences belonging to the second subset of CAVIAR. 
To keep the evaluations more realistic, the first few frames 
(200 frames) of each sequence are used to train the background 
model irrespective of the presence of foreground objects 
(ie. background frames were not handpicked for training). 

We first evaluated the tracking performance for various 
block advancements. Results presented in Table II indicate 
that a block advancement of 1 pixel obtains the best tracking 
performance, while larger advancements lead to a decrease in 
performance. 

Comparisons with GMIM, histogram of features and NVD, 
presented in Table III, indicate that the proposed method 
leads to considerably better tracking performance. For tracking 
accuracy (IMOTA), the next best method (SA) led to an average 
accuracy of 27.3%, while the proposed method led to 30%. 
For tracking precision (IMOTP), the next best method (SA) led 
to an average pixel distance of 13, while the proposed method 
reduced the distance to 11.7. 

IV. IMain Findings 

Pixel-based processing approaches to foreground detection 
can be susceptible to noise, illumination variations and dy- 
namic backgrounds, partly due to not taking into account rich 
contextual information. In contrast, region-based approaches 
mitigate the effect of above phenomena but suffer from 'block- 
iness' artefacts. The proposed foreground detection method 
belongs to region-based category, but at the same time is able 
segment smooth contours of foreground objects. 

Contextual spatial information is employed through 
analysing each frame on an overlapping block-by-block basis. 
The low-dimensional texture descriptor for each block allevi- 
ates the effect of image noise. The model initialisation strategy 
allows the training sequence to contain moving foreground 
objects. The adaptive classifier cascade analyses the descriptor 
from various perspectives before classifying the corresponding 
block as foreground. Specifically, it checks if disparities are 
due to background motion or illumination variations, followed 
by a temporal correlation check to minimise the occasional 
false positives emanating due to background characteristics 
which were not handled by the preceding classifiers. 

The probabilistic foreground mask generation approach inte- 
grates the block-level classification decisions by exploiting the 
overlapping nature of the analysis, ensuring smooth contours 
of the foreground objects as well as effectively minimising the 



number of errors. Unlike many pixel-based methods, ad-hoc 
post-processing of foreground masks is not required. 

Experiments conducted to evaluate the standalone per- 
formance (using the difficult Wallflower and I2R datasets) 
show the proposed method obtains on average better results 
(both qualitatively and quantitatively) than methods based on 
GMMs, feature histograms, normalised vector distances, self 
organising maps and stochastic approximation. 

We furthermore proposed the use of tracking performance 
as an unbiased approach for assessing the practical usefulness 
of foreground segmentation methods, and demonstrated that 
the proposed method leads to considerable improvements in 
object tracking accuracy on the CAVIAR dataset. 
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